Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies
نویسندگان
چکیده
This paper presents the first version of Estonian Universal Dependencies Treebank which has been semi-automatically acquired from Estonian Dependency Treebank and comprises ca 400,000 words (ca 30,000 sentences) representing the genres of fiction, newspapers and scientific writing. Article analyses the differences between two annotation schemes and the conversion procedure to Universal Dependencies format. The conversion has been conducted by manually created Constraint Grammar transfer rules. As the rules enable to consider unbounded context, include lexical information and both flat and tree structure features at the same time, the method has proved to be reliable and flexible enough to handle most of transformations. The automatic conversion procedure achieved LAS 95.2%, UAS 96.3% and LA 98.4%. If punctuation marks were excluded from the calculations, we observed LAS 96.4%, UAS 97.7% and LA 98.2%. Still the refinement of the guidelines and methodology is needed in order to re-annotate some syntactic phenomena, e.g. inter-clausal relations. Although automatic rules usually make quite a good guess even in obscure conditions, some relations should be checked and annotated manually after the main conversion.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملUniversal Dependencies for Japanese
We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon Un...
متن کاملConstraint Grammar-based conversion of Dependency Treebanks
This paper presents a new method for the conversion of one style of dependency treebanks into another, using contextual, Constraint Grammar-based transformation rules for both structural changes (attachment) and changes in syntacticfunctional tags (edge labels). In particular, we address the conversion of traditional syntactic dependency annotation into the semantically motivated dependency ann...
متن کاملSyntactically annotated corpora of Estonian
Syntactically annotated corpora are needed 1) to train and test parsers and various language technological products grammar checkers, information retrievers and extractors, machine translators etc; 2) to check the agreement of existing linguistic theories with the real language usage. The corpora can be annotated on different levels of depth. In shallow syntactically annotated corpora a syntact...
متن کاملEnhancing PTB Universal Dependencies for Grammar-Based Surface Realization
Grammar-based surface realizers require inputs compatible with their reversible, constraint-based grammars, including a proper representation of unbounded dependencies and coordination. In this paper, we report on progress towards creating realizer inputs along the lines of those used in the first surface realization shared task that satisfy this requirement. To do so, we augment the Universal ...
متن کامل